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Abstract 

The Integrated Completed Likelihood (ICL) criterion has been pro- 
posed by Biernacki et al. (2000) in the model-based clustering frame- 
work to select a relevant number of classes and has been used by statis- 
ticians in various application areas. 

A theoretical study of this criterion is proposed. A contrast related 
to the clustering objective is introduced: the conditional classification 
likelihood. This yields an estimator and a model selection criteria class. 
The properties of these new procedures are studied and ICL is proved 
to be an approximation of one of these criteria. 

We oppose these results to the current leading point of view about 
ICL, that it would not be consistent. Moreover these results give in- 
sights into the class notion underlying ICL and feed a reflection on the 
class notion in clustering. 

General results on penalized minimum contrast criteria and on mix- 
ture models are derived, which are interesting in their own right. 
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1 Introduction 



Model-based clustering is introduced in Sections 1.1 and 1.2. Our purpose 
is to better understand the behavior of the ICL model selection criterion of 
Biernacki et al. (2000), which is presented in Section 1.3. 

The main topic of this work is the choice of the number of classes in 
a model-based clustering framework, and then the choice of the number 
of components of a Gaussian mixture. The interested reader may refer to 
Titterington et al. (1985) or McLachlan and Peel (2000) for comprehensive 
studies on Gaussian mixture models. The last also provides an overview on 
the approaches for assessing the number of components, and particularly 
on the standard and widely used penalized likelihood criteria, such as AIC 
(Akaike, 1973) or BIC (Schwarz, 1978). 

The ICL criterion studied here is an alternative to BIC. It was up to now 
widely presented as a penalized likelihood criterion, which penalty involves 
an "entropy" term. Here, however, we prove that it is actually a penal- 
ized contrast criterion with a criterion which is different from the standard 
likelihood: this justifies why this is not surprising, nor a drawback, that 
ICL does not asymptotically select the "true" number of components, even 
when the "true" model is considered. Even for data arising from a mixture 
distribution, a relevant number of classes may differ from the true number 
of components of the mixture. 

The reason why we introduce this new contrast L cc (Section 2.1) is not 
that we believe it a priori to be the better one for a clustering purpose, 
but rather that it enables to theoretically study and understand ICL. We 
prove (Section 4.3) that ICL is an approximation of a criterion linked to this 
contrast: studying further ICL then amounts to studying L cc . The notion 
of class underlying ICL is proved to be a compromise between Gaussian 
mixture density estimation and a strictly "cluster" point of view (Section 5). 

Let X be a random variable in W 1 with distribution f p - A and X±, . . . , X n 
an i.i.d. sample of the same distribution. Let us denote X = (X\, . . . , X n ). 

All proofs are gathered in Section 6. 

1.1 Gaussian Mixture Models 

Mk is the Gaussian mixture model with K components: 

M K = j/( ■ ;0) = y~V fc <?K • ; wfe) 9= (tti, . . .,tt k ,^x, ■ . .,ojk) e j , 
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where <fi is the Gaussian density and ®k C Hk x (R d x S|) with = 
{(tti, . . . ,7Tr:) G [0, 1]^ : J2^=i = 1} and the set of positive definite 
dxd real matrices. Constraints on the model may be imposed by restricting 
&K- We typically have in mind the decomposition suggested by Celeux 
and Govaert (1995). "General" (no constraint) and "diagonal" (diagonal 
covariance matrices) models will be considered here as examples. 

Those are studied here as parametric models. It is then assumed the 
existence of a parametrization ip : @x C ~R Dk — > Mk- It is assumed that 
Qk and (p are "optimal" , in the sense that Dk is minimal. Dk is the number 
of free parameters in the model Mk and is called the dimension of A4/c- 
For example, at most (K — 1) mixing proportions need to be parametrized. 

It shall not be needed to assume the parametrization to be identifiable, 
i.e. that <p is injective. Indeed our purpose is twofold: identifying a relevant 
number of classes to be designed; and actually designing those classes. The- 
orem 4.2 justifies that the first task can be achieved under a weaker "iden- 
tifiability" assumption. Theorem 3.2 then guarantees that our estimator 
converges to the best parameters set, any of which is as good as the others. 
There will be no "true parameter" assumption. The classes can finally be 
defined through the MAP rule (see Section 1.2). Practically, the parameters 
themselves are never the quantities of interest here. They only stand as 
a convenient notation and this is also why we expect that the assumption 
about the Fisher information (see Theorem 4.2) is technical and could maybe 
be avoided with other techniques. Please refer to Baudry (2009, Chapter 4) 
for a more comprehensive discussion about the identifiability question. 

1.2 Model-Based Clustering 

Although the results are stated first for much more general situations, this 
paper is devoted to the question of clustering through Gaussian mixture 
models. 

The process is standard (see Fraley and Raftery, 2002): 

• fit each considered mixture model; 

• select a model and a number of components based on the first step; 

• classify the observations through the MAP rule (recalled below) with 
respect to the mixture distribution fitted in the selected model. 

Notably, the usual choice is made here, to identify a class with each fitted 
Gaussian component. The number of classes to be designed is then chosen 
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at the second step. See for example Hennig (2010) and Baudry et al. (2010) 
for alternative approaches. 

Let us recall the MAP classification rule. It involves the conditional 
probabilities of the components 

V<9 g 6x,VA;,Vx, Tk(x;6) = -■ 

2^ k > =1 ^ k '(j)[x\U k >) 

T k {x; 9) is the probability that X arises from the k th component, condition- 
ally to X = x, under the distribution defined by 6. Let us also denote 
Tik{0) = T k {Xi\0). The MAP classification rule for x is then 

z MAP (fl) = argmax T k {x;6). 
ke{i,...,K} 

Let us denote by L the observed likelihood associated to X: 

n K 



G G K , L(0;X) = n^TT^pQ;^)). 



=1 k=i 



The maximum likelihood estimator in the model Mr is denoted by #^! LE . 



1.3 ICL 

Our motivation is to better understand the ICL (Integrated Completed Like- 
lihood) criterion. Let us introduce the classification likelihood associated to 
the complete data sample (X, Z) (Z G {0, 1} K is the unobserved label of X: 
Z k = 1 <^ X arises from component k): 

n K 

wee K , l c (0;(x,z)) =nnM(^ w *)) Ztt - w 

i=l k=l 

To mimic the derivation of the BIC criterion (Schwarz, 1978) in a clus- 
tering framework, Biernacki et al. (2000) approximate the integrated clas- 
sification likelihood through a Laplace's approximation. Then they assume 
that the classification likelihood mode can be identified with #^ LE as n is 
large enough and replace the unobserved Zi k s by their MAP estimators un- 
der e^ LE . This is questionable, notably when the components of #^! LE are 
not well separated. They derive the ICL criterion: 

crit IC L(K) = logL(^ LE ) + EE^ AP (^ LE )logr lfc (^ LE ) - ^D K . 

i=l k=l 
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McLachlan and Peel (2000) replace the Z^s by their conditional expecta- 
tionsr, fc (^ LE ): 



crit ICL (iO = logL(^ LE ) + £ J2 n k (0% LE ) logr ifc (^ LE ) - ^D K . (2) 

i=l k=l 

Both versions of the ICL appear to behave analogously, and the latter is 
considered from now on. 

The ICL differs from the standard and widely used BIC criterion of 
Schwarz (1978) through the entropy term (see Section 2.2): 

n K 

We K , ENT(0;X) = -^^T ifc (0)logr jfc (0). (3) 

i=l fc=l 

The BIC is known to be consistent, in the sense that it asymptotically 
selects the true number of components, at least when the true distribution 
actually lies in one of the considered models (Keribin, 2000; Nishii, 1988). 
This nice property may however not suit a clustering purpose. In many 
applications, there is no reason to assume that the distribution conditional 
on the (unobserved) labels Z is Gaussian. The BIC in this case tends to 
overestimate the number of components since several Gaussian components 
are needed to approximate each non-Gaussian component of the true mix- 
ture distribution f p . And the user may rather be interested in a cluster 
notion — as opposed to this strictly component approach — which also 
includes a separation notion and which be robust to non-Gaussian compo- 
nents. Of course, it depends on the application, and on what a class should 
be. It may be of interest to discriminate into two different classes a group 
of observations which the best fit is reached with a mixture of two Gaussian 
components having quite different parameters (we particularly think of the 
covariance matrices parameters). BIC tends to do so. But it may also be 
more relevant and may conform to an intuitive notion of cluster, to identify 
two very close — or largely overlapping — Gaussian components as a single 
non-Gaussian shaped cluster (see for example Figure 3)... 

ICL has been derived with this viewpoint. It is widely understood and 
explained (for instance in Biernacki et al., 2000) as the BIC criterion with 
a supplemental penalty, which is the entropy (Section 2.2). Since the last 
penalizes models which maximum likelihood estimator yields an uncertain 
MAP classification, ICL is more robust than BIC to non-Gaussian compo- 
nents. However we do not think that the entropy should be considered as a 
penalty term and an other point of view will be developed in this paper. 
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The references here were found by browsing the result obtained from 
Google Scholar citations about Biernacki et al. (2000). Only 3 pages of 16 
have been studied... The behavior of ICL has been studied through simula- 
tions and real data studies by Biernacki et al. (2000), McLachlan and Peel 
(2000, Section 6.11), Steele and Raftery (2010) and in several simulation 
studies (See Baudry, 2009, Chapter 4). Besides several authors chose to use 
it for the mentioned reasons in various applications area: Goutte et al. (2001) 
(fMRI images); Pigeau and Gelgon (2005) (image collection automatic sort- 
ing); Hamelryck et al. (2006) (protein structure prediction); De Granville 
et al. (2006) (robots learning); Mariadassou et al. (2010) (uncovering groups 
of nodes in valued graphs and application to host-parasite interaction net- 
works in forest ecosystems analysis); Rigaill et al. (2012) (comparative ge- 
nomic hybridization profile); etc. 

This practical interest for ICL lets us think that it meets an interest- 
ing notion of cluster, corresponding to what some users expect. But no 
theoretical study is available. Our main motivation is to go further in this 
direction. This leads to considering new estimation and model selection pro- 
cedures for clustering, similar to ICL but for which the development of the 
underlying logic is driven to its conclusion, from the estimation step to the 
model selection step, instead of introducing the MLE. It is proved that ICL 
is an approximation of a criterion which is consistent for a particular loss 
function. 

2 A New Contrast: Conditional Classification Like- 
lihood 

The contrast minimization framework turns out to be a fruitful approach. It 
enables to fully understand that ICL is not a penalized likelihood criterion, 
as opposed to the usual point of view. It should rather be linked to an other 
contrast: the conditional classification likelihood. 

2.1 Definition, Origin 

In a clustering context, the classification likelihood (see (1)) is an interest- 
ing quantity but neither the labels Z are observed, nor we assume that they 
even exist (think of the case several models with different number of com- 
ponents are fitted: then at most one can correspond to the true number of 
classes). Beside the first-mentioned works of Biernacki et al. (2000), Bier- 
nacki and Govaert (1997), for example, already proposed to directly involve 
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the classification likelihood to select the number of classes, by estimating the 
unobserved data. We propose here to consider its expectation conditional on 
the observed sample X. In case there exists a true classification and a model 
with the true number of classes is considered, this conditional expectation 
can be interpreted as the quantity the closest to the classification likelihood, 
which can be considered given the available information. 

Let us report the following algebraic relation between L and L c : 

n K 

W9eQ K , logL c (0) = logL(#) + ^^Z iA; logT ifc (0). (4) 

i=l k=l 

Then, denoting the conditional expectation of logL c (#) by logL cc (#) (for 
Conditional Classification log Likelihood), 

logL cc (#) = E e [logL c (#)|X] 

n K 

= logL(fl) + £> ifc (0) lo § r ^)' 

i=l k=l 

s v ' 

-ENT(0;X) 

which is obviously linked to the clustering objective. We consider in the 
following — logL cc as an empirical contrast to be minimized. 

2.2 Entropy 

logL cc differs from logL through the entropy (see (3)). 

The behavior of the entropy is based on the properties of the function 
h : t € [0, 1] i — > (— ilogi) (with /i(0) = 0). This nonnegative function (see 
Figure 1) takes zero value if and only if t = or t = 1. It is continuous but 
not differentiable at 0, and in particular it is not Lipschitz over [0, 1], which 
will be a cause of analysis difficulties. Let us also introduce the function 
tlx '■ (ti,...,tjc) 6 n&; i — > J2k=i h{tk)- This nonnegative function (see 
Figure 2) then takes zero value if and only if there exists ko £ {1, . . . ,K} 
such that ti- = 1 and = for k ^ ko. It reaches its maximum value logi^T 
at (ti, . . . , tx) = (i, • • • , 77) (proof in Section 6). 

Now, the contribution ENT(#; Xi) of a single observation to the total 
entropy ENT(#; x) is considered. Figure 3 represents a dataset simulated 
from a four-component Gaussian mixture. Let be such that f(.;9) = f p . 
First, ENT(#;xj) ~ if and only if there exists ko such that Tik ~ 1 and 
Tj/u ~ for k 7^ ko- There is no difficulty to classify Xi in such a case (for 
example x% x ). Second, ENT(#;xi) is all the greater that (m, . . . , tik) is 
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closer to (jjj • • ■ 3 i- e - that the classification through the MAP rule is 
uncertain. The worst case is reached as the conditional distribution over 
of the components 1, ...,K is uniform. The observation x% 2 for example 
has about the same posterior probability \ to arise from each one of the 
components surrounding it. Its individual entropy is about log 2. 

In conclusion the individual entropy is a measure of the assignment con- 
fidence of the considered observation through the MAP classification rule. 
The total entropy ENT(#; x) is the empirical mean assignment confidence, 
and then measures the MAP classification quality for the whole sample. 



14 
12 




Figure 3: A dataset example 



Involving this quantity in a clustering study means that one expects 
the classification to be confident. The class notion underlying the choice of 
the conditional classification likelihood as a contrast is then a compromise 
between the fit (and then the idea of Gaussian-shaped classes) because of 
the likelihood term on the one hand, and the assignment confidence because 
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of the entropy term on the other hand (which is rather a cluster point of 
view) . 

2.3 logL cc as a Contrast 

See for example Massart (2007) for an introduction to contrast minimiza- 
tion. Let us consider the best distribution from the L cc point of view in a 
model M m = {/( . ; 9) : 9 G m }, namely the distribution minimizing the 
corresponding loss function 

9 m G argmin{d K L(r, /(•;#)) [ENT(0;X)]} . 

eee m 

S v ' 

argminE/p [-logL cc (0)] 
6»ee m 

V v ' 

this set is denoted by @^ 

The existence of Kfp [— logL cc (0)] is a very mild assumption. The non- 
emptiness of Q m may be guaranteed for example by assuming Q m to be 
compact. Let K be fixed and consider the minimization of the loss func- 
tion at hand in the model Mk (Section 1.1). First of all, remark that 
logL cc = logL if K = 1: is the set of parameters of the distributions 
which minimize the Kullback-Leibler divergence to f p . Now, if K > 1, 
9® K G Q® K may be close to minimizing the Kullback-Leibler divergence if the 
corresponding components do not overlap since then, the entropy is about 
zero. But if those components overlap, this is not the case anymore (Exam- 
ple 2.1). 

To completely define the loss function, and to fully understand this 
framework, it is necessary to consider the best element of the universe IA : 

argminE/o [-log L cc (0)] . 

The universe IA must be chosen with care. There is no natural relevant 
choice, on the contrary to the density estimation framework where the set 
of all densities may be chosen. First the considered contrast is well-defined in 
a parametric mixture setup, and not necessarily over any mixture densities 
set because of the definition of the entropy term involving the definition of 
each component. However, this would still enable to consider mixtures much 
more general than mixtures of Gaussian components. The ideas developed 
in Baudry et al. (2010) may for example suggest to involve mixtures which 
components are Gaussian mixtures. But this would not make sense. The 
mixture with one component which is a mixture of K Gaussian components, 
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and which then yields a single non-Gaussian-shaped class, always has a 
smaller — logL cc value than the corresponding Gaussian mixture yielding K 
classes. This illustrates how carefully the components involved in the study 
must be chosen: involving for example any mixture of Gaussian mixtures 
means that one considers that a class may be almost anything and may 
notably contain two Gaussian-shaped clusters very far from each other! The 
components should in any case be chosen with respect to the corresponding 
cluster shape. The most natural is then to involve in the universe only 
Gaussian mixtures: U may be chosen as ^i<k<k m -M-k- 

Example 2.1. f p is the normal density A/"(0, 1) (d = 1 ). The model M2 = 
{\4>{ ■ ! °~ 2 ) + \<t>{ ■ ; Hi °~ 2 )', £t € R, a 2 > 0} is considered. 

Let us consider in this most simple situation. We numerically obtain 
that &2 = {(— no, <7q), (/xq, o-q)}, so that, up to a label switch, there exists a 
unique minimizer of Eft? [— log L cc (/Lt, a 2 )] in 02 in this case (see Figure 4), 
with Ho ~ 0.83 and <Tq ~ 0.31. This solution is obviously not the same 
as the one minimizing the Kullback-Leibler divergence (see Figure 5). This 
illustrates that the objective with the —log L cc contrast is not to recover the 
true distribution, even when it is available in the considered model. 

The necessity of choosing a relevant model is striking in this example: 
this two-component model should obviously not be used for a clustering pur- 
pose, at least for datasets with great enough size. 




Figure 4: Ejp [logL cc (/i, a 2 )] w.r.t. 
H and a, and for Example 2.1 

The estimator associated to the 




Figure 5: log/ p (red, which is 
also log/(.;# 2 KL )) and log /(.;«§) 
(blue) for Example 2.1 

— logL cc contrast is now considered. 
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3 Estimation: MLccE 



Let us fix the number of components K and the model Mr- The subscript 
K is omitted in the notation of this section. A new minimum contrast 
estimator is considered. Results are stated in a general parametric model 
setting with a general contrast 7 and a model Ai with parameter space C 
R D , and then the conditions they involve are discussed in our framework. 
General conditions ensuring the consistency of such an estimator are given 
in Theorem 3.1. They notably involve the Glivenko-Cantelli property of the 
class of functions {7(0) : G 0}. Sufficient conditions in terms of bracketing 
entropy for this property to hold are recalled and verified in the considered 
context in Section 3.2. Those results are also useful in the study of the model 
selection step (Section 4). Brought together, they provide the consistency 
of the estimator in Gaussian mixture models: this is Theorem 3.2. 

Here and hereafter, all expectations E and probabilities P are taken with 
respect to / p • A. For a general contrast 7, we write its empirical version: 
7n($) = n Yli=i T(^i Xi)- R D is equipped with the infinite norm: \/0 G 

|| ^ || 00 = maxi<i<£) \9i\. For any r G N* U {00} and for any g : R d — > R, 

1 

||5|| r = E/p [|<?(X)| r ]'- if r < 00 and ||<?||oc = esssup^^/p IffPOl (recall that 
esssup z ^ P Z = inf{z : P[Z < z] = 1} and thus: H^Hoo < sup^g^ppjp \g(x)\). 
For any linear form I : R D — > R, ||Z||oo = || rnax ^(^) - 

3.1 Definition, Consistency 

The minimum contrast estimator is named MLccE (Maximum conditional 
classification Likelihood Estimator): 

MLcdS G argmin-logL cc (0). 
e&e 

To ensure its existence, we assume that is compact. This is a heavy as- 
sumption, but it will be natural and necessary for the following results to 
hold. That the covariance matrices are bounded from below is a reasonable 
and necessary assumption in the Gaussian mixture framework: without this 
assumption, neither the log likelihood, nor the conditional classification like- 
lihood would be bounded (for K > 2). Insights to choose lower bounds on 
the proportions and the covariance matrices are suggested in Baudry (2009, 
Section 5.1). The upper bound on the covariance matrices and the com- 
pactness condition on the means, although not necessary in the standard 
likelihood framework, do not seem to be avoidable here (see Section 3.2). 
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This is a consequence of the behavior of the entropy term as a component 
goes to zero. 

The following theorem, which is directly adapted from van der Vaart 
(1998, Section 5.2), gives sufficient conditions for the consistency of a mini- 
mum contrast estimator 9. We write V# G 0,V0 C @,d(9,Q) = inf \\9— 0\\oo- 

flee 

Theorem 3.1. Let Q CM D and'j :Q xR d — >R. Assume: 

39° E 6 such that E fP b(6»°)l = mmE fP h(9)} (<& 0° is not empty) (Al) 

eee 

Ve > 0, inf E fP h(6)] > E fP \j(9 )] (A2) 
{e:d(e,e°)>e} 1 J 1 J 



sup 

flee 



7n(0)-E /P [ 7 (0)] ^0 (A3) 



De/me Vn, = 0(Xl, . . . , X n ) € 6 suc/i that <y n (9) < 7 n (6>°) + o P (l). 
TTien d(0, 6°) — 0. 

n— >oo 

The strong consistency holds if (A3) is replaced by an almost sure con- 
vergence (this is the case under the conditions we are to define) and if the 
inequality in the definition of 9 holds almost surely. 

Assumption (Al) is the least that can be expected. It is guaranteed if 
the parameter space is compact. 

Assumption (A2) holds, too, under this compactness assumption: since 
9 G @k i— > Efp [7(0)] reaches its minimum value on the compact &k\{9 £ 
@K '■ d(9, @° K ) > e}, it is necessarily strictly greater than Efp [7(6*°)] . 

Assumption (A3) is a bit strong but it will be guaranteed under the com- 
pactness assumption through bracketing entropy arguments in Section 3.2. 

Sketch of proof . The assumptions guarantee a convenient situation. With 
great probability as n grows, from (A3), J n {9) is uniformly close toEjp [7(0)]: 
this holds for 9 and 9°. Then, from the definition of 9, Efp [7(0)] cannot 
be much larger than Ef P [7(0°)] which reaches the minimal value. By (A2), 
this implies that 9 cannot be far from 0°. 

□ 

Let us apply Theorem 3.1 to Gaussian mixtures, with 7 = — logL cc and 
= Ok ■ The two following hypotheses will be involved: 

|M|| r < 00 with M{x) = sup | 7 (0; x)\ < 00 f^dX-a.e. ( H %e,r) 



eee 



\M'\\ r < 00 with M'(x) = sup 

eee 



89 J 



(6;x) 



< 00 f^dX-a.e. (H. 



M \ 
~/,e,r) 
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Theorem 3.2 (Weak Consistency of MLccE, compact case). Let M be 
a Gaussian mixture model with compact parameter space C M. D . Let 
0° = argmhiggQ Kf P [— log L cc (6)] . Let 0° C M. D open over which logL cc is 
defined, such that C 0° and assume that L &a 1 holds. 

Let MLccE £ Q be an estimator (almost) maximizing logL cc : 



V0° G 0°, Vn G N*, -\ogL cc (9 MLccE ) < -log L cc (9°) + o P (n). 
Then d(9 MLccE , 0°) — ► 0. 

71 >OC 

H, M \ „ n 1 results from lemma 3.2 and shall be discussed in Section 3.2. 

Under the compactness assumption, (p^^^ [ s then consistent. It is even 
strongly consistent if it minimizes the empirical contrast almost surely. Let 
us highlight that it then converges to the set of parameters minimizing the 
loss function, which has no reason to contain the true distribution — except 
for K = 1 — even if the last lies in M. 



3.2 Bracketing Entropy and Glivenko-Cantelli Property 

Recall a class of functions over M d is P- Glivenko-Cantelli, with P a prob- 
ability measure over M. d , if it fulfills a uniform law of large numbers for 
the distribution P. A sufficient condition for a family Q to be P-Glivenko- 
Cantelli is that it is not too complex, which can be measured through its 
entropy with bracketing: 

Definition 3.1 (L r (P)-entropy with bracketing). Let r G N* and l,u G 

L r (P). The bracket [I, u] is the set of all functions g G Q with I < g < u. [I, u] 
is an e-bracket if \\l — u\\ r < e. The bracketing number iVn(e, Q, L r (P)) is the 
minimum number of e -brackets needed to cover Q. The entropy with bracket- 
ing 8n(e,Q , L r (P)) ofQ with respect to P is the logarithm of Nn(e,Q , L r (P)) . 

It is quite natural that the behavior of all functions lying inside an e- 
bracket can be uniformly controlled by the behavior of the extrema of the 
bracket. If those endpoints belong to Li(P), they fulfill a law of large num- 
bers, and if the number of them needed to cover Q is finite, then this is no 
surprise that Q can be proved to fulfill a uniform law of large numbers: 

Theorem 3.3. Every class Q of measurable functions such that 
£[](s, G,L\(P)) < oo for every e > is P- Glivenko-Cantelli. 

The reader is referred to van der Vaart (1998, Chapter 19) for accu- 
rate definitions and a proof of this result. This is a generalization of the 
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usual Glivenko-Cantelli theorem. We shall prove that the class of functions 
{7( . ; 9) : 9 G ®k} has finite e-bracketing entropy for any e > and the 
assumption (A3) will be ensured. 

Prom now on, since is typically assumed to be compact, it is assumed 
that 9ce°cR D with 0° open over which 7 is defined and C 1 for f p d\- 
almost all x. This is no problem for Gaussian mixture models with logL cc 
(or the standard likelihood by the way), for example with the general or di- 
agonal model. But this requires (with the logL cc contrast) the proportions 
to be positive. Actually, this could be avoided here, but we will need this 
assumption for the definition of M' (Hypothesis H^q t ). As already men- 
tioned, components going to zero must be avoided. For the same technical 
reason, we have to assume the mean parameters to be bounded. 

Lemma 3.1 guarantees that the bracketing entropy of {7( . ; 9) : £ 0} 
is finite for any e, if is convex and bounded. The assumption about the 
differential of the contrast is not a difficulty in our framework, provided 
that non-zero lower bounds over on the proportions and the covariance 
matrices are imposed. The lemma is written for any bounded and included 
in (which is not assumed to be bounded itself) since it will be applied 
locally around 9° in the Section 4. 

For any bounded C R D , diam© = sup{||#i - 2 ||oo : 01,02 6 ©}• 

Lemma 3.1 (Bracketing Entropy, Convex Case). Let r£N*, D G N* and 
C MP assumed to be convex. Let 0° C MP open such that C 0° and 
7 : 0° x R d — > R. 9 e 0° 1 — ► 7(0; x) is assumed to be C 1 for pdX-almost 
allx. Assume thatH^Q r holds. Then 

V0C0,Ve>O, iV [] (e,{7(0):0G0},M|r)< V 1. 

Remark that does not have to be compact. Its proof is a calculation 
which relies on the mean value theorem, hence the convexity assumption. 
The natural parameter space of diagonal Gaussian mixture models, with 
equal volumes (if d > 1) or not, for instance, is convex (see Examples 6.1 
and 6.2, p. 26). General mixture models have a convex natural parameter 
space, too, since the set of definite positive matrices is convex. However, 
there is no reason that the parameter space should be convex in general. 

Lemma 3.1 can then be generalized at the price of assuming to be 
compact, and included in an open set 0° such that H M e0 r holds. This 
is no difficulty for the mixture models we consider, under the same lower 
bounds constraints as before (since 0° itself can be chosen to be included 
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in a compact subset of the set of possible parameters) . The entropy is then 
increased by a multiplying factor Q, which only depends on and roughly 
measures its "nonconvexity" . Since the exponential behavior of the entropy 
with respect to e is of concern, this does not make the result really weaker. 

Lemma 3.2 (Bracketing Entropy, Compact Case). Let rgff*,D6N* and 
C R D assumed to be compact. Let 0° C R D open such that C 0° and 
7 : 0° x R d — >R. flee° — > 7(0; x) is assumed to be C 1 for j 'dX- almost 
all x. Assume that H M ~ n holds. 
Then 

3Q e N*,V0 C 0,Ve > 0, 

jV n ( £j {7(0):0e0}, 

Q is a constant which depends on the geometry of@(Q = lif@is convex). 

This lemma is proved by applying Lemma 3.1 since is still locally 
convex. Since it is compact, it can be covered with a finite number Q of 
open balls, which are convex. Lemma 3.1 then applies to the convex hull of 
the intersection of with each one of them. The supremum of M' is taken 
over 0° — instead of — to make sure that the assumptions of Lemma 
3.1 are fulfilled over those entire balls, which may not be included in 0. 

The result we need for Section 4 is Lemma 3.3, obtained from Lemma 
3.1 by a slight modification. Since it is applied locally there, the convexity 
assumption is no problem. A supplementary and strong assumption H^q ^ 
is made. This is not fulfilled in the general Gaussian mixtures framework. 
A sufficient condition is that the support of f p is bounded. This is false 
of course for most usual distributions we may have in mind, but this is a 
reasonable modeling assumption: most modeled phenomena are bounded. 
Another sufficient condition to guarantee this assumption is that the contrast 
is bounded from above. This is actually not the case of the contrast —log L cc , 
but this can be imposed: replace — logL cc by (— logL cc A C) and, provided 
that C is large enough, this new contrast behaves like logL cc . This is a 
supplemental difficulty in practice to choose a relevant C value, though. 

Lemma 3.3 (Bracketing Entropy, Convex Case). Let r > 2, D £ N* and 

C R D assumed to be convex. Let 0° C R D open such that C 0° and 
7 : R D x 0° — >R. 9 G 0° i — > 7(0; x) is assumed to be C 1 for f^d\- almost 



< Q 



IM'IL diamQ 



D 



V 1. 
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all x. Assume that H, 



7,©,oo 



and H, 



rM' 

7 ,e,2 



hold. Then 



V9 C 9,Ve > 




D 



N[] (e, {7W : € 6}, II • ||r) < 



M'|| 2 cZiame 



V 1. 



Let us remark that those results are quite general. We are interested 
here in their application to the conditional classification likelihood, but they 
hold all the same in the standard likelihood framework. Maugis and Michel 
(2011) already provide bracketing entropy results in this framework. Our re- 
sults cannot be directly compared to theirs since they consider the Hellinger 
distance. The dependency they get on the parameter space bounds and the 
variable space dimension d is explicit. This is helpful to derive an oracle 
inequality. But they could not derive a local control of the entropy, hence 
an unpleasant logarithm term in the expression of the optimal penalty they 
get. Their results also suggest the necessity of assuming the contrast to be 
bounded: see the discussion after the Theorem 4.2. The results we propose 
achieve the same rate with respect to e. They depend on more opaque quan- 
tities (UMlloo and H-M'l^). This notably implies, from this first step already, 
the assumption that the contrast is bounded — over the true distribution 
support. However, it could be expected to control those quantities with 
respect to the parameter space bounds. Moreover, beside their simplicity, 
they straightforwardly enable to derive a local control of the entropy. 



As illustrated by Example 2.1, model selection is a crucial step. The number 
of classes may even be the target of the study. Anyhow, a relevant number 
of classes must obviously be chosen so as to design a good classification. 

Model selection procedures introduced here are penalized conditional 
classification likelihood criteria: 



Most results are stated for a general contrast 7 and any family of models 
{M-k}i<k<k m and then applied to — logL cc and the Gaussian mixtures 
family of models {M.k}i<k<k m introduced in Section 1.1. 

In Section 4.1, the consistency of such a model selection procedure 
("identification" point of view) is proved for a class of penalties. Sufficient 



4 Model Selection 



crit(-FC) 



logL cc (0™ 



2MLccE 



) + pen(K). 
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conditions are given in the general Theorem 4.1, which is applied to the 
framework we are interested in in Theorem 4.2. The heaviest condition of 
Theorem 4.1 (B4) may be guaranteed under regularity and (weak) identifia- 
bility assumptions, and is discussed in Section 4.2. Our approach is inspired 
from works of Massart (2007) and is the first step to reach non-asymptotic 
results. 



4.1 Consistent Penalized Criteria 

Assume that Kq exists such that 

\/K < K , inf E fP [7(0)] < inf E fP h(0)] 
and ^ e *o 7 eee K 

\/K > K , inf E fP [ 7 (0)] < inf E fP [7(0)] 
9e&K 9&0k 

which means that the bias of the models is stationary from the model A4k ■ 
it is the "best" model. Remark that the last property should hold mostly 
in the mixtures framework, and notably if the models were not constrained, 
and then were nested. Under this assumption, a model selection procedure 
is expected to asymptotically recover Kq, i.e. to be consistent. This is an 
identification aim (see McQuarrie and Tsai, 1998, Chapter 1). It would 
be disastrous to select a model which does not (almost) minimize the bias. 
And it is besides assumed that the model M.k contains all the interesting 
information (typically, the structure of the classes). 

Let us stress that the "true" number of components of f p is not directly 
of concern: it is in particular not assumed that it equals Kq, and is not even 
assumed to be defined (/ p does not have to be a Gaussian mixture). Kq is 
the best choice from the particular point of view introduced by using the 
logL cc contrast, which is not density estimation, neither is it identification 
of the "true" number of components. 

Theorem 4.1. {@k}kk<k m a- collection of models with Ok C M. Dk (D\ < 

■ ■ ■ < Dx M ) and let Q® K £ Q%, with Q® K = argmin Ejp [^(0)]- Assume 

eee K 

Kq = min argmin Ef P [7(6^)] (Bl) 

1<K<K M 

WK, 6k G Ok defined such that ^niGx) < Iu^k) + °p(1) 

(B2) 

fulfills y n K ) A Ef P [ 7 (^)] 
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VK, < 



pen(K) > and pen(K) = op(l) when n — > +00 
pen(K) — pen(iv)) > 00 ui/ien K > K 

' n— >+oc 



(B3) 



n( 7 n(^ ) " ln{0 K )) = Op(1) /or any if G argmin E /P [7(6^)] . (B4) 

1<K<K M 

Define K such that K = min argmin < 7 n (#K-) + pen(if) >. 

i<k<k m L v v 'J 



crit(X) 



TTien P[if / 2f ] ► 0. 



Sketch of proof. First prove that K cannot asymptotically "underestimate" 
K . Suppose E f , [y(f K j\ > E fP [j(6° Ko )]. From (B2), ( 7n (fo) - j n (0 Ko )) 
is asymptotically of order Efp [j(0x)] — E/p [t(^ )] > 0- Since the penalty 
is op(l) from (B3), crit(ifo) < crit(if) asymptotically and K > K. 

That K does not asymptotically "overestimate" Kq, involves the heaviest 
assumption (B4). It is more involved since then (Efp [t(^)] — IE /s= [ 7 (0# o )] ) 
is zero. The fluctuations of {^u{Gk) — 7n(0Ko)) around zero then have to 
be evaluated and canceled by a penalty large enough. According to (B4), a 
penalty larger than ^ should suffice. (B3) guarantees this condition. □ 

Assumption (B3) defines the range of possible penalties; Assumption 
(B2) is guaranteed under assumption (A3) of Theorem 3.1: 

Lemma 4.1. For a fixed K , assume (A3). Then (B2) holds. 

Indeed, asymptoticaly, minimizing 6 \- > j n (0) cannot differ much from 
minimizing 9 1— > Ejp [y(0)} if they are uniformly close to each other (A3). 

Assumption (B4) is the heaviest assumption. Section 4.2 is devoted to 
deriving sufficient conditions so that it holds. This will justify the 

Theorem 4.2. (Mk)kk<k m Gaussian mixture models with compact pa- 
rameter space ®k and @° K = argmin0 g @ x E^p [— logL cc (#)] for any K. Let 
Kq = min argmin E/ P [-log L CC (Q° K )] . Assume VK,V6> G ®K,V9k g 



1<K<K M 



oO 



E fP [-log Lcc(e)] = E fP [-logL cc (G^ )] 

^-logL cc (e;x) = -logL cc (e° Ko ;x) f^dX-a.e. (CI) 
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For any K, let eg C R Dk open over which log L cc is defined, such that 
Ok C assume that H, M T ^ a and H, M , _ n hold and that VO ^ G 

A logL cc ,©^,oo log L cc , 0^,2 i* 

®jo ^6»°, = ^(^/ p [ — 1°S -^cc(^)]) jgo is nonsingular; let O^LccE ^ q k w ^ 



-logL cc (0% Lc n < -log L cc (9 U k) + op(n). 

Let pen : {1, . . . , Km} — > K + (which may depend on n, {®k) 1<K<K and 
the data) such that 



VK€{1,...,K M } 



{pen(iT) > and pen(K) = op(n) when n — > +oo 
(pen(K) — pen(K')) — - — > oo for any K' < K. 
n— >+oo 



Select K such that K = min argmin {-log L cc (6^ LccE ) + pen(K)}. 

1<K<K M 

Then F[K / K ] > 0. 

n — >oo 

If @k is convex, M and M' can be denned as suprema over Qk instead 
of @g and there is no need to introduce the sets 0g. The new "identi- 
fiability" assumption (CI) introduced is reasonable: as expected the label 
switching phenomenon is no problem here. But it is necessary for the iden- 
tification point of view to make sense, that a single value of the contrast 
function x i — > j(6; x) minimizes the loss. Remark that in the standard 
likelihood framework, this holds at least if any model contains the sample 
distribution, since it is the unique Kullback-Leibler divergence minimizer. 
Obviously, several parameter values, perhaps in different models, may rep- 
resent it, besides the label switching. We do not know any such result with 
the — logL cc contrast and hypothesize that the assumption holds. 

The assumption about the nonsingularity of I e o is unpleasant, since it 
is hard to be guaranteed. Hopefully, it could be weakened. The result of 
Massart (2007) (Theorem 7.11) which inspires this, and is available in a 
standard likelihood context, does not require such an assumption since it 
does not rely on the study of this link between the contrast and the param- 
eters but on a clever choice of the involved distances (Hellinger distances), 
and on particular properties of the log function. However, this is a usual 
assumption (see Redner and Walker, 1984, or below). 

Massart (2007) moreover does not require the contrast (i.e. the likeli- 
hood) to be bounded, as we have to. Remark however that the application 
of his Lemma 7.23 to obtain a genuine oracle inequality involves an assump- 
tion similar to the boundedness of the contrast. So that it seems reasonable 
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that the assumptions about M and M' (the last is much milder than the 
former) be necessary. They are typically ensured if either the contrast is 
bounded or if the support of f p is bounded. 

The conditions about the penalty form are analogous to that of Nishii 
(1988) or Keribin (2000), which are both derived in the standard maximum 
likelihood framework. As those of Keribin (2000), they can be regarded as 
generalizing those of Nishii (1988) when the considered models are Gaussian 
mixture models. Indeed, Nishii (1988) considers penalties of the form c n Dx 
and proves the model selection procedure to be weakly consistent if ^ — > 
and c n — > oo. Note that Nishii (1988) assumes the parameter space to be 
convex. He moreover notably assumes that 0^ = {9%} and that the coun- 
terpart of I go is nonsingular, together with other regularity assumptions. 
Those results are not particularly designed for mixture models. Instead, as 
we do, Keribin (2000) considers general penalty forms and proves the pro- 
cedure to be consistent if pen ( K ^ — ). 0, pen(K ) — > oo and lim inf ^r^k > 1 
if K > K' . These conditions are equivalent to Nishii's if pen(K) = c n Dx- 
In a general mixture model framework, she assumes the model family to be 
well-specified, the same notion of identifiability as we do, and a condition 
which does not seem to be directly comparable to ours about Igo but which 
tastes roughly the same. It might be milder. Those assumptions are proved 
to hold with the standard likelihood contrast for Gaussian mixture mod- 
els with lower bounded, spherical covariance matrices which are the same 
for all components, and if the means belong to a compact. Our conditions 
about the penalty are a little weaker than Keribin's, but they still are quite 
analogous. Moreover, as compared to those results, we notably have to keep 
the proportions away from zero. This is necessary because the entropy term 
must be handled. It does not seem easy to extend the methods used by 
Keribin (2000) to our framework. 

The strong version of Theorem 4.2, which would state the almost sure 
consistency of K to Kq, would then probably involve penalties a little heav- 
ier, as Nishii (1988) and Keribin (2000) proved in their respective frame- 
works: both had to assume that pe "(^) — > oo. 

log log n 

Theorem 4.2 is a direct consequence of Theorem 4.1, Lemma 4.1, Theo- 
rem 3.2, which can be applied under those assumptions, and of Corollary 4.2 
below and the discussion about its assumptions along the lines of Section 4.2. 

4.2 Sufficient Conditions to Ensure Assumption (B4) 

Let us introduce the notation S n r y{6) = n(y n (0) — Ejp [7(#)])- The main 
result of this section is Lemma 4.2. Some intermediate results which en- 



20 



able to link Lemma 4.2 to Theorem 4.1 via Assumption (B4) are stated 
as corollaries and proved subsequently. Lemma 4.2 povides a control of 

sup ege ^ _ d ^ +/3 2 (with respect to p) and then of ij^zgj|2 +/3 2 ■ With 
a good choice of (3, and if S n (j{9 ) — 7(0)) can be linked to \\9 — 6]^, it 
is proved in Corollary 4.1 that it may then be assessed that n\\9 — 6 = 
Op(1). 

Plugging this last property back into the result of Lemma 4.2 yields 
(Corollary 4.2) n(-f n (9° K ) - 7n(9 K )) = Pp(1) for any model K G 
argmin 1<i ^ < ^- M Ejp an< ^ then, under mild identifiability condition, 

n {ln(0K o ) ~ ln(0K)) = Op(l), which is Assumption (B4). 

Lemma 4.2. Lei DeN* and C R D convex. Let @° C R D open such that 
9 C @° and 7 : 9° xl^l. £ 9° i-> 7(6*; 2) is assumed to be C 1 over 
@° for f*>d\- almost allx. Let 9° G 9 such thatE fP [j(9 )) = infE/p [7(6*)]. 

Assume f/iai -£T^@ ^ and iL?@ 2 

Then 3a > O/Vn, V/3 > 0, V77 > 0, loit/i probability larger than (1 — e~ n ), 
S,(7(0°) - T(»)) a 



SFW 5 ^(IKI| 2 M^ + (IImiu + WU)d 

+ ||M'|| 2 Vnr ? /3 + HMIIooT? 



VVoie i/iat a is an absolute constant which notably does not depend on 9°. 

Sketch of proof . The proof relies on results of Massart (2007) and on the 
evaluation of the bracketing entropy of the class of functions at hand. Lemma 
3.3 provides a local control of the entropy and hence, through Theorem 
6.8 in Massart (2007), a control of the supremum of S n ( , y(9 ) — 7(0)) as 
\\9 — 9 \\^ >o < a, with respect to a. The "peeling" Lemma 4.23 in Massart 
(2007) then enables to take advantage of this local control to derive a fine 
global control of sup eg @ ^pz^ife^ffl ? f° r an Y P- This control in expecta- 
tion, which can be derived conditionally to any event A, yields a control in 
probability thanks to Lemma 2.4 in Massart (2007), which can be thought 
of as an application of Markov's inequality. □ 

Corollary 4.1. Same assumptions as Lemma 1.2, but the convexity ofO. 
Besides assume that Iqo = (Ejp [ 7 (0)])i0o is nonsingular. Let (9 n ) n >i 

such that 9 n G 9, 7„(0J < 7„(0°) + Op(^) and 9 n 9°. Then 

n— >oo 

n||4-0°||L = Op(l). 
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The constant involved in Op(l) depends on D, ||M||oo, ||M'||2 and Iqo. 

This is a direct consequence of Lemma 4.2: it suffices to choose /3 well. 
The dependency of Op(l) in D, HMH^, ||Af'||2 and Iqo is not a problem 
since we aim at deriving an asymptotic result: the order of \\9 — ^ ||^ o with 
respect to n when the model is fixed is of concern. 

The assumption that Igo is nonsingular plays an analogous role as As- 
sumption (A2) in Theorem 3.1: this ensures that Efp [7(0)] cannot be close 
to Kfp [7(0°)] if is not close to 0°. But this stronger assumption is 
necessary to strengthen the conclusion: the rate of the relation between 
E fP [7(e)] - E fP [7(0°)] and ||0 - 0°|| can then be controlled... 



Should this assumption fail, 39 £ Q/9'Iqo9 = =4> Ej 



E f p 



7(0° + A0) 



0°] +o(A 2 ) and then there is no hope to have a > such that Efp [/y( 
7(0°)] > a\\9 — 9°\\ 2 : this approach cannot be applied without this - 
admittedly unpleasant — assumption. Perhaps an other approach (with dis- 
tances not involving the parameters but directly the contrast values) might 
enable to avoid it, as Massart (2007) did in the likelihood framework. 

Corollary 4.2. Let (Qk)kk<k m be models with, for any K, ®k C M' Da '. 
Assume that D\ < • • ■ < Dr m ■ For any K, assume there exists an open set 
@ K C R Dk such that Q K C Q K and such that with @° = @f U • • • U @ Km , 
7 : Q° x M. d — > M is defined and C l for f p d\-almost all x. Assume that 

H ™qo,oc and H y%°,2 hold - Let > f° r an y K > ®k = aT S m[n ee0 K E f p [7(0)1 
an'd9 K E@ K . 

Let Kq = min argmin 1<ii - <A:A/ Efp [7(6^)] and assume Vif, V0 £ @k, 
E^ [ 7 (0)] = E f p [j(9° Ko )] 7(0) = 7 (f& ) fdX - a.e. 

Let K = {K £ {1, . . . , K M } : EfP [iK)] = E fP [ 7 « )] }• 
For any K £ K, let 9k £ &k such that 



ln(9 K ) < ln{8 K ) + Op(-) and 9 K 

\nJ 



9° 



Assume that Lqo_ = (Efp [y{9)\j Q is nonsingular for any K £ K. 

Then VK £ K, n( ln (9 I<0 ) - ln(9 K )) = O p (1). 

This last corollary states conditions under which assumption (B4) of 
Theorem 4.1 is ensured. 
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4.3 A New Light on ICL 



The previous section suggests links between model selection penalized cri- 
teria with the standard likelihood on the one hand and with the conditional 
classification likelihood we defined on the other hand. Indeed penalties with 
the same form as those given by Nishii (1988) or Keribin (2000) with the 
standard likelihood are proved to be "consistent" in our framework. There- 
fore, by analogy with the standard likelihood framework, it is expected that 
penalties proportional to Dx conform an efficiency point of view (think of 
AIC), and that penalties proportional to Dxlogn are optimal for an iden- 
tification purpose (think of BIC). This possibility to derive an identification 
procedure from an efficient procedure by a log n factor is notified for example 
by Arlot (2007). 

Let us then consider by analogy with BIC the penalized criterion 
crit Lcc _ /Ci (iO = logL cc (^ LccE ) - l ^D K . 

The point is that we almost recover ICL (replace 9^ LE by MhccE in (2)), 
which may then be regarded as an approximation of this L CC -ICL criterion. 
The corresponding penalty is ^P-Dr - , and the derivation of L CC -ICL illus- 
trates that the entropy should not be considered as a part of the penalty. 
This notably justifies why ICL does not select the same number of compo- 
nents as BIC or any consistent criterion in the standard likelihood frame- 
work, even asymptotically. Actually, it should not be expected to do so. 

When ^ LccE differs from 0^ LE , the former provides more separated 
clusters. The compromise between the Gaussian component and the cluster 
viewpoint is achieved with ^ LccE from the very estimation step. The user is 
provided a solution which aims at this compromise for each number of classes 
K. However, the number of classes selected through L CC -ICL differs seldom 
from the one selected by ICL in simulations (See Baudry, 2009, Chapter 4). 

Finally, L CC -ICL is quite close to ICL and enables to better understand 
the concepts underlying ICL. ICL remains attractive though, notably be- 
cause it is easier to implement than L CC -ICL. 



5 Discussion 



Two families of criteria, in the clustering framework, are distinguished: it 
is shown that ICL's purpose is of different nature than that of BIC or AIC. 
The identification theory for the criteria based on the conditional classifi- 
cation likelihood is — not surprisingly — very similar to the one for the 
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standard likelihood. A major interest of the newly introduced estimator 
and criteria is to better understand the ICL criterion and the underlying 
notion of class. This is nor a simple notion of cluster — as for example for 
the k-means procedure — neither a pure notion of "component" — as un- 
derlying the MLE/BIC approach — but a compromise between both. ICL 
leads to discovering classes matching a subtle combination of the notions 
of well separated, compact, clusters, and (Gaussian) mixture components. 
It then enjoys the flexibility and modeling possibilities of the model-based 
clustering approach, but does not break the expected notion of cluster. Bet- 
ter understanding of the ICL criterion now means better understanding the 
newly involved contrast L cc . 

The choice of the involved mixture components must be handled with 
care in this framework since it leads the cluster shape underlying the study. 
Several forms of Gaussian mixtures may be involved: for example, spherical 
and general models may be compared, or models with free proportions may 
be compared with models with equal proportions. 

Besides it should be further studied how the complexity of the models 
should be measured when several model kinds are compared. The dimension 
of the model as a parametric space works for the reported theoretical results. 
But we are not completely convinced that it is the finest measure of the 
complexity of Gaussian mixture models. As a matter of fact this simple 
parametric point of view amounts to considering that all parameters play 
an analogous role. This is not really natural. 

A further theoretical step would be to drive non-asymptotic results and 
oracle inequalities. This may give more precise insights about the best 
penalty shape to use, and then justify the use of the slope heuristics of 
Birge and Massart (2007) (see also Baudry et al., 2011 or Baudry, 2009 for 
simulations and discussions on this topic). 

A practical challenge is to provide efficient optimization algorithms. 
Some work has been done in this direction already: see Baudry et al. (2008) 
and Baudry (2009, Section 5.1). But they need be improved to be more reli- 
able, and above all to run much faster, which would obviously be a condition 
for a spread practical use of the new contrast. 

A possibility to make this contrast more flexible would be to assign 
different weights to the log likelihood and the entropy: log L CCQ , = a log L + 
(1 — a) ENT, with a G [0; 1]. This would enable to tune how important the 
assignment confidence is with respect to the Gaussian fit... The difficulty 
would then be to choose a. A first insight which comes in mind is to calibrate 
a from simulations of situations in which the user knows what solution he 
expects. 



24 



6 Proofs 

Proof (Max of Hk, V- 7). If kg reaches a max. value at . . . ,t° K ) under 
the constraint Ylk=l = 1 then, with 5 : (ti, . . . , tx) Ylk=i ^h, 

3X G R/dh K (4, . . . , t° K ) = AdS(t?, t%). 

This is equivalent to Vfc, logi-j! + 1 = A. Then, Vfc, W , = and since 
Ef=itg = l, this yields tg = £. □ 

Proof of Theorem 3.1. Let e > and rj = M d ^ e o) >£ Ef P [ 7 (0)]-Ep [y(6 )] > 
(from assumption (A2)). For n large enough and with large probability, 
from assumption (A3) and the definition of 8, 

sup | 7n (0) -E fP [ 7 (0)] |<5 and 7n (0) < 7n (0°) + 

Then 

E /P [ 7 (0)] - E /P [ 7 (0 )] < E /P [<y(0)] - 7n (0) + 7 „(0) - 7 n(#°) 

+ 7n^ )-E /P [ 7 (0°)] 

< 77. 

And 0°) < e with great probability, as n is large enough. □ 

Proof of Lemma 3.1. Let e > 0, and 9 C 0, with bounded. Let £ be 
a grid in which "e-covers" in any dimension with step e. £ is for 
example 0* x • • • x ©f 3 with 



We {1,..., D}M = {oLn,Oiin + £,- 



> "max f > 



where 



Vie{l,...,-D},{^:0G©} C 



— ft I 

2 ' max ~r 2 



This is always possible since is convex. For the sake of simplicity, it is 
assumed without loss of generality, that £ C 0. With the || • ||oo norm, the 
step of the grid e is the same as the step over each dimension, e: 

We e,36 e g @ £ /\\e-e E \\oo < |. 

And the cardinal of e is at most 

jsupg^ - inf^P) v i < / diam0 \ D vi 
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Now, let 0i and 2 in and x G R d . 



h(0i;x) - 7(02 ;ar) I < sup 
ee[6i;9 2 ] 



< sup 
see 



O7 
d0 



(«;x) 



?2||oo 



09 Jim 



h-e 



2 ||oo , 



M'(rr) 

since is convex. Let G 6 and choose £ G £ such that 110 — 0, 
Then 

and 



< ^ 
e ||oo _ 2 ' 



Va;GK d , |7(0 £ ; x) - 7 (0; x) | < M'{x) - 
7 (0~ £ ;x) - £ -M'(x) < 7 (0; x) < 7 &;a:) + |m'(x). 



The set of £||M'|| r -brackets (for the || • || r -norm) 



{[7& 



|M';7(^) + |M']:0 £ Ge £ } 



then has cardinal at most ^ dia ™ e ^ D v 1 and covers j 7 (0) : G ©}. □ 

Example 6.1 (Diagonal Gaussian Mixture Model Parameter Space is Con- 
vex). Following Celeux and Govaert (1995), we write [pXkBk] for the model 
of Gaussian mixtures with diagonal covariance matrices and equal mixing 
proportions. To keep simple notation, let us consider the case d = 2 and 
K = 2(d=lorK = l are obviously particular cases!). A natural 
parametrization of this model ( which dimension is 8) is 



eR 4 x 



*4 <p 1 

^2 ( 



i\ (0 5 

02/ 'U 06 



+ 



/0 3 \ (0 7 

■'UJ'U 8 



Then \p\uBk] = </?(M 4 x IR + * 4 ) and the parameter space M 4 x M + * 4 is convex. 

Example 6.2 (The Same Model with Equal Volumes is Convex, too...). 
[pXBk] is the same model as in the previous example, but the covariance 
matrices determinants have to be equal. With d = 2 and K = 2, a natural 
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parametrization of this model of dimension 7 is 



G M 4 x R+* 3 



^ 



Then \p\Bk] = 92 (M 4 x IR + * 3 ) and the parameter space R 4 x M + * 3 is convex. 

Proof of Lemma 3. 2. Let 0\ , . . . , Oq be a finite covering of consisting of 
open balls such that U® =1 O q C 0°. Such a covering always exists since 
is assumed to be compact. Remark that 

= U^ =1 (0 9 n 0) C uJ =1 conv(O g n 0). 



(9;x) 



< 



Now, for any q, conv(O g n0) is convex and sup 0gconv ( O ne ) 

M'(x) since conv(Oq n 0) C O q C 0°. Therefore, Lemma 3.1 applies to 
0,n9c conv(O q n0): 



N {] {e,{i{6):6zQnO q },\\-\\ r )< 



\M'\\r diam© 



D 



V 1. 



Since iV [] ( e ,{ 7 (0) : 6 G 0}, || • || r ) < iV^e, U« =1 { 7 W : 9 € n OJ 



i II I r/j 



the result follows. 



□ 



Proof of Lemma 3.3. Consider the grid e of the proof of Lemma 3.1. Let 
0\ and 02 in and x 6 M d . Since is convex, 



l(9i;x) - 7(6*2; x) 



< sup 
0e[0i;02] 



99) {e , x) 



x (2 sup \-y(6;x)\) 
8£{6i,e 2 } 

< M'(x) 2 \\9i - 2 ||^ o (2||M|| oo ) r - 2 /PrfA-a.e. 



r-2 



Let 6 1 G and choose # e G e such that \\9 — 6 £ \\oo < §• Then 



7 (0 e ;x)- 7 (0;x) <M'(x)l ( £ -) r WIMW^)^ /*>-a.e. 
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and the set of brackets 



~f{6 £ ;x) -erM'(x)r\\M\\oS 2 1 "; j(9 £ ; x) + e?M'(x)f \\M\\oS 2 1 " 

: 6£ 9, 



2 . 



2 _4 ? . 

(of || • || r -norm length (2 r-)||M|| cx 5' H-M'Hj £ r ) has cardinal at most 
V 1 and covers {7(0) : 8 £ o|, which yields Lemma 3.3. □ 

Proof of Theorem 4-1- Let /C = argmin 1< ^ < ^ Eyp [t(^)]- By assump- 
tion, Kq = min/C. 

It is first proved that K does not asymptotically "underestimate" Kq. 
Let K $K. Let e = §(E /P > [ 7 (#^)] - E /P [7^)]) > 0. From (B2) and 
(B3) (pen(K) = op(l)), with large probability and for n large enough: 

\ ln (§ K ) - E fP [ 7 «)] I < I |7n(^ ) - E /P [ 7 (^ )] I < I pen(Ko) < |. 



Then 



crit(K) = 7n (0#) + pen(^) > E fP [ 7 «)] - | + 

= E fP [ 7 (9° Ko )] + ^ > ln (0 Ko ) + pen(^o) +e. 

o V v ' 

crit(Ko) 

Then, with large probability and for n large enough, K ^ K. 

Let now K £ fC (hence K > -Ko)- This part of the result is more involved 
than the first one but at this stage, it is not more difficult to derive: all the 
difficulty is hidden in the strong assumption (B4)... Indeed, it implies that 
3V > 0, such that for n large enough and with large probability, 

n{-y n 0K o ) - 7n(M) <V. 

Increase n enough so that n(pen(i^) — pen(iTo)) > V with large probability 
(which is possible from assumption (B4)). Then, for n large enough and 
with large probability, 

V 

crit(K) = 7„(0at) + pen(K) > j n (0 Ko ) + pen(K) > crit(K ). 

n 

And then, with large probability and for n large enough, K ^ K. 

Finally, since F[K + K ] = Y, Ki K ^ = K] + Y,kzK, k+k, A* = K l 
the result follows. □ 
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Proof of Lemma 1^.1, For any e > 0, with large probability and for n large 
enough: 

7n(0) - %p [7(e)] + %, [7(<9)] " E/p [7(^°)] = 7«(0) " %•> [7^)] 

S v ' V v ' 

>-£ >0 

= 7n(0) " ln(9°) + 7n(0°) - E,, [ 7 (0°)] • D 

S v ' > „ ' 

Proof of Lemma 4. 2. Actually, the proof 3as it is written beloWe holds for an 
at most countable model (because this assumption is necessary for Lemma 
4.23 and Theorem 6.8 in Massart (2007) to hold). But it can be checked 
that both those results may be applied to a dense subset of {7(0) ■ 9 G 
0} containing 0° and their respective conclusions generalized to the entire 
set: choose Q count a countable dense subset of 0. Then, for any G 0, 
let n G 6 count ► 9. Then, j(9 n ;X) j(9;X). Now, what- 

n— >oo n— >oo 

ever g : R D x (R d ) n -> M such that (9 G M D H> 5(6*, X) continue a.s., 
sup ege g(0;X) = sup 0e Qcount g (9; X) a.s. Hence, Ejp [sup^Q g(0; X)] = 
E/p [supggQcount X)]. Remark however that the models which are ac- 
tually considered are discrete, because of the computation limitations. 
Let us introduce the centered empirical process 

S nl (9)=n ln (9)-nE fP h(9;X)}. 

Here and hereafter, a stands for a generic absolute constant, which may dif- 
fer from a line to an other. Let 9° G © such that E/*> [7(6*°)] = inf Ef P [7(6*)]. 

Let us define 

Va > 0, 0(a) = {9 G 9 : \\9 - #°||oo < or}. 
On the one hand, for all r G N*\{1}, 

V0 G 9(a), | 7 (0°; x) - 7(0; x)[ < M'(x) 2 \\9° - 9\\ 2 00 {2M(x)) r - 2 
since ©(a) C © is convex. And thus, 

VS e 9W, %- [17(0") - 7(9)11 < IIM'B ||0° - 9fc(2||M|U)'- 2 

On the other hand, from Lemma 3.3, for any r G N*\{1}, for any 5 > 0, 
there exists C$ a set of brackets which cover {(7(0 ) — 7(0)) : G O(c)} 
(deduced from a set of brackets which cover {7(0) : 9 G 0(a)}...) such that: 

1 T — 9 

/r'\ ? 2 / 4IIMI 
VrGN*\{l},V[g J)5u ] G C 5 , ||g u - g,|| r < l-j 6* "' 
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and such that, writing e^^'®^^ the minimal cardinal of such a Cg, 

D 



diam9(cj) ||M'|| 2 



V 



V 1. 



(6) 



/ 



Then, according to Theorem 6.8 in Massart (2007), 
3a, Ve e]0, 1], VA measurable such that P[A] > 0, 



sup S n ( 7 (9°) - 7 (8)) 



a j— 
< —\Jn 

e 



e\\M'\\2° 



H(u,e(a)) du 



+ 2^-||M||oo + \\M'\\2ajH(\\M'\\ 2 a,e(a)) 
+ (l + 6 e )||M'|| 2 J2nlo g ^ + ^||M|Ulog^ 4] - 



(7) 



Now, we have 



1 



Vt € M + , / Wlog - V0 du 

/o V U Jo 



Ml 



log — du 

u 



-^H l ° 8 l du = {tA1) ^rfv 

by the Cauchy-Schwarz inequality. Together with (6), this yields 



\/t G M+, e(ff))d« < >/D y ^log 



2IIM' || 2 a 



V du 



< \/Z>(t A2||M / || 2 ff). /log 



(8) 



* A 1 ' 

2||M'|| 2 a A 1 



after a simple substitution. 

Next, let us apply Lemma 4.23 in Massart (2007): From (6), (7) and (8), 



Vcr > } Ef, 



sup 5 n ( 7 (0-7W) 
0ee(<r) 
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with <p(t) = ~vW5 ? \\M' \\ 2 t\/ log 2 ^ + 2(| [|AT||oo + ||M'|| 2 t)l>log2 



+ (l + 6e)||M / || 2 J2nlog^ + ^||M|| 00 log^. 



As required for Lemma 4.23 in Massart (2007) to hold, is nonincreasing. 
It follows 



Vf3 > 0,E y 



£n(7(0°)-7(fl)) 

Sf^ 



+ /3 2 



< 4rV(/3)- 



We then choose e = 1 and apply Lemma 2.4 in Massart (2007): for any 
rj > and any /3 > 0, with probability larger than 1 — e _7? , 

+ (UMlloo + ||M'|| 2 /3)Z)log2 + ||M / || 2 /3^/+ l|M||oo^.D 

Proof of Corollary 4-1- Let e > such that B(9°,e) C 0°. Then, since 
#n — > 0°, there exists no G N* such that, with large probability, for n > no, 
# n G B(6 ,e). Now, B(9°,e) is convex and the assumptions of the corollary 
guarantee that Lemma 4.2 applies. Let us apply it to 9 n : Vn > no, V/3 > 
0, with great probability as r] is large, 



* ^ (iimii. + iiM l2 , )D 



+ ||M / || 2 /V™7 + IIMHooA (9) 



But since I^o is supposed to be nonsingular, V# G B(6°,e), 



1 1 2 
lloo 



e /p [0] -e /p [e°] = {e-e^'ie^e-e^ + riwe-e^we 

> (2«' + r(||0-0 o || oo ))||0-0°||2 



where a' > depends on /go and r 



fulfills r(x) 



for ||# — #°||oo small enough (e may be decreased...) 



-> 0. Then, 



^6 G 5(0°, e),E /P [0] - E/p [0°] > a'||0 - ! 



1 1 2 
oo " 



(10) 
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SnH0°) ~ 7(4)) = n( ln (0°) - 7n (0 n )) + nE /P [ 7 (0 n ) - 7 (# )] Since 
> Op(l)+nE /P [ 7 (0 n )- 7 (0 )], 
(9) together with (10) leads (with great probability) to 

,m z,o„2 / ||M'|| 2 (v^ + VP + ^)/3+l|Af||oo(I> + ^) + Pip(l) 

n ll y n - ^ Hoc S — 



„ - (||M'|| 2 (v^D + y^n + £>)/? + ||M||oo(-D + r?) 

as soon as the denominator of the right-hand side is positive. It then suffices 
to choose (5 such that this condition is fulfilled and such that the right-hand 
side is upper-bounded by a quantity which does not depend on n to get the 
result. Let us try /3 = -h^ with (3o independent of n: 

,m n0u2 / \\M'\\ 2 {^D + jr 1 + D)fo+\\M\\ 00 (D + r 1 )+Ov{l) 

n \\Vn -0 Woo < — — 7 — 

^-ji [\\M'\\ 2 (Vd + ^fj + D)p + WMW^D + r?) 

This only holds if the denominator is positive. Choose /3q large enough so 
as to guarantee this, which is always possible. The result follows: with large 
probability and for n larger than no, we have n\\9 n — 0°\\% o = COp(l) with 
C depending on D, ||M||oo, ||M'||2, Igo and 77. □ 

Proof of Corollary Jf.,2. This is a direct application of Corollary 4.1. Let 
K G JC: ¥,jp [j{9 K )] = Efe> [i(9k )]- &k can be assumed to be convex: if 
it is not, 9k lies in B(9 Kq ,e) C 0° with large probability for large n and &k 
may be replaced by B(9 Ko ,e). According to Lemma 4.2, with probability 
larger than (1 — e~ v ) for n large, with j3 = for any /3o > 0: 

Sn(7(^)-7fe) < " ""^ " ( \\M'\\JVDK + VV+^)P 



+ UMlloopjf + »/) 

But, according to Corollary 4.1, n\\9° K — 0k 1 1^, = ^p(l). Moreover, by 
definition, 

Sn(l(e K )-l(9 K )) = n{ ln {9 K )- ln {9 K ))+n(^[ 1 {9 K )] - E /P [ 7 (^)]) 

>o 

Thus, n(^ n {9® K ) — 7n (0ic)) = Op(l). This holds for any K £ fC and then in 
particular for Kq and if. Besides, 7n(^x) = 7™(^E" ) since, by assumption, 
7(0£r) = 7«,) / p dA-a.e. Hence n( 7n (9 Ko ) - ln(9 K )) = Pp(1). □ 
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